Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Variant Discovery ◾ 145

Covering all variants is daunting and hence prioritizing these variants with potential

associations to phenotypes of interest is usually the key target of the variant annotation.

During the last decade, numerous GWASs were conducted to identify genomic variants

associated with many complex diseases and traits. However, most studies were focused

on human and some model organisms. Databases for variant mapping and genotype–

phenotype association were developed to serve as rich resources for variant annotation.

Examples of these databases include NHLBI, which contains health information col-

lected by NHLBI’s epidemiological cohorts and clinical trials, dbGaP, which is an NCBI

database of Genotypes and Phenotypes association, the Exome Aggregation Consortium

(ExAC), which includes sequencing data from a variety of large-scale sequencing projects,

Catalogue of Somatic Mutations in Cancer (COSMIC), etc. Numerous similar databases

were developed for specific diseases such as cancer, autoimmune diseases, and Alzheimer’s

disease. Prioritizing genetic variants relevant to the human diseases is the top. Guidelines

have been developed for investigating variants and their association with human diseases

so that such knowledge can be used for diagnosis in a clinical setting. Indeed, after acquir-

ing high-confidence variants, the next step is to annotate and interpret these variants using

either prior knowledge or functional prediction based on the impact of the variant on the

translated protein. The studies on genetic variants are usually interested in the variants

that are associated with diseases, traits, or have an effect on functions of protein. There are

a variety of consequences that can be caused by variants. A variant may be pathogenic or

implicated with healthy conditions, or may be a damaging variant that alters the normal

function of a gene, or may be deleterious variant that reduces the quality of the affected

individuals. Hence, variant annotation must be conducted after filtering variants as dis-

cussed above to avoid misinterpretation, false positive, and false negative. Generally, we

can define variant annotation as the process of assigning functional or phenotype infor-

mation to genetic variants such as SNPs, InDels, or copy number variants. Based on this

definition, perhaps, the most significant variants are the ones on the coding region of the

genome. This is because mutations on coding region may have a direct impact on the pro-

tein and may be implicated in a disease. The variants on non-coding region of the genome

may also have impact but the challenge is that it is difficult to establish a testable hypoth-

esis. Therefore, statistical methods were developed for variant prioritization by incorpo-

rating diverse functional evidence, so that variants with small effect sizes but possessing

functional features may be prioritized over variants with similar effect sizes but less likely

to be functional.

There are numerous variant annotation tools that attempt to associate variants to knowl-

edge-based information and generate reports. The most commonly used tools include SIFT,

SnpEff, Annovar, and VEP, which we will use to annotate the variants.

4.4.1 SIFT

The SIFT [11], which stands for Sorting Intolerant from Tolerant, was first introduced

in 2001 as an online variant annotation tool that annotates coding region of genes with

the missense variant effects on the translated protein. SIFT relies on the assumption that

substitutions in conserved regions are more likely to be deleterious if the missense SNV